Unsupervised multimodal modeling of cognitive and brain health trajectories for early dementia prediction

Predicting the course of neurodegenerative disorders early has potential to greatly improve clinical management and patient outcomes. A key challenge for early prediction in real-world clinical settings is the lack of labeled data (i.e., clinical diagnosis). In contrast to supervised classification approaches that require labeled data, we propose an unsupervised multimodal trajectory modeling (MTM) approach based on a mixture of state space models that captures changes in longitudinal data (i.e., trajectories) and stratifies individuals without using clinical diagnosis for model training. MTM learns the relationship between states comprising expensive, invasive biomarkers (β-amyloid, grey matter density) and readily obtainable cognitive observations. MTM training on trajectories stratifies individuals into clinically meaningful clusters more reliably than MTM training on baseline data alone and is robust to missing data (i.e., cognitive data alone or single assessments). Extracting an individualized cognitive health index (i.e., MTM-derived cluster membership index) allows us to predict progression to AD more precisely than standard clinical assessments (i.e., cognitive tests or MRI scans alone). Importantly, MTM generalizes successfully from research cohort to real-world clinical data from memory clinic patients with missing data, enhancing the clinical utility of our approach. Thus, our multimodal trajectory modeling approach provides a cost-effective and non-invasive tool for early dementia prediction without labeled data (i.e., clinical diagnosis) with strong potential for translation to clinical practice.

was collected within 6 months of the exam date.Trajectories were formed from data instances for each participant by looking for available data within a 6-month window of baseline + 2 years, baseline + 4 years, etc., where baseline corresponds to the first instance where cognitive scores, grey matter density, and β-amyloid were available for a person.Clinical diagnoses based on individuals' final assessments were as follows: cognitively normal (CN: 234), stable MCI (sMCI: 224), progressive MCI (pMCI: 19), and Alzheimer's disease (AD: 94).Patients were identified as pMCI if they progressed from MCI to AD within a period of 3 years, while sMCI patients remained diagnosed as MCI for the same period.For quality control, individuals whose diagnosis switched from AD to MCI were removed from the dataset.
Additionally, individuals whose standardized grey matter density score increased by 25% or more from one visit to the next were removed from the dataset.Demographics (Table S1) at baseline were as follows: the average age was 72.3 years (± std. of 6.9), the average educational attainment was 16.4 years (± std. of 2.6), and 48% of participants were female.40% of individuals were identified as APOE4 positive (i.e., they had one or two ε4 alleles) and the remainder were APOE4 negative.
Structural MRI scans (T1) for ADNI participants were acquired across ADNI sites as described online (see: https://adni.loni.usc.edu/methods/documents/mri-protocols).T1 scans were processed to extract grey matter density scores from medial temporal cortex following the method described by Giorgio et al [6].T1 scans were pre-processed with SPM 12 [7] to segment them into grey matter, white matter, and cerebrospinal fluid.They were normalized to a study-specific template using the DARTEL toolbox [8] and then individual grey matter segmentation volumes were normalized to MNI space without modulation.At each voxel location, unmodulated values represent grey matter density.
The images were then smoothed with a 3mm3 isotropic kernel and resliced to MNI resolution 1.5mm × 1.5mm × 1.5mm voxel size.We then calculated a previously-validated disease-specific biomarker using the processed images.This biomarker was formed by applying partial least squares regression [9, §12.5.2] with recursive feature elimination to create orthogonal features from T1-weighted MRI voxels having maximum covariance with MMSE score (in this way, the scores were also normalized at the voxel level, so values are often negative).In this context, recursive feature elimination was used to iteratively remove the least relevant voxels.As the resulting value is density-and not volume-based, differences in brain size do not impact the biomarker score.
PET imaging with the Florbetapir ([18F]AV45, FBP) radiotracer was performed at each ADNI site using methods as described in the respective protocols.FBP data were realigned and co-registration was performed to each participant's structural MRI using the mean of all frames.Cortical standardized uptake value ratios (SUVRs) were generated by averaging FBP retention in a standard group of ROIs defined by FreeSurfer v5.3 [10] and then dividing by the average uptake in the whole cerebellum.These SUVRs were converted to centiloid [11] scale with the transformation  ↦ → 196.9 •  − 196.03.

MACC.
We collated longitudinal data with help from the Memory Ageing & Cognition Centre at the National University of Singapore.Data was collected as part of both the Harmonization study [12]- [15] and the ABRI sub-study [16]- [18].The MACC dataset consisted of 157 trajectories containing a single PET measurement (i.e.β-amyloid) and at least one MRI (i.e., grey matter density from medial temporal cortex), MoCA and MMSE scores.Trajectories were formed using available data from study visits: baseline, year 2, year 4. Clinical diagnoses based on individuals' final available assessments were as follows: cognitively normal (CN: 36), mild MCI (50), moderate MCI (19), and Alzheimer's disease (AD: 55).Demographics at baseline were as follows: the average age was 73.0 years (± std. of 7.4), average educational attainment was 7.9 years (± std. of 4.8), and 55% of participants were female.30.4% of individuals were APOE4 positive.T1 MRI scans were acquired with a 3T Magnetom Trio, a Tim system (Siemens) with a 32-channel head coil, at Clinical Imaging Research Center of the National University of Singapore.Grey matter density scores were generated using the same method as for ADNI.PET imaging with the C-labelled Pittsburgh Compound-B ([11C]PiB) was also performed at the Clinical Imaging Research Center.Images were registered to the same space as used by the ADNI study, and regions of interest (ROI) templates for the cortical regions of the frontal, parietal, temporal, occipital, anterior and posterior cingulate, nucleus accumbens, and thalamus [19].Detailed descriptions of the methodology can be found in Nai et al [20].The SUVR values were converted to centiloid scale using the linear map  ↦ → 100(−1.009)/1.067.

Model training
The models learned on ADNI were trained using standard 10-fold cross validation.The data is partitioned uniformly at random into 10 subsets of approximately equal size.For each subset , 1 ≤  ≤ 10, we train a model using data from the 9 other subsets and predict on the th subset.For each fold, we trained the MTM using the data from the 9 other folds ( ≈ 514 trajectories) training on the remaining data ( ≈ 57 trajectories).
Individual models were trained as follows.We start with a training dataset: In an abuse of notation, we allow the length of each sequence to depend on .We adopt a mixture of linear gaussian state space models with   components, as described in (??) and (??): (1) where   ∈ R   ≥0 governs the population-level cluster distribution so that   =1   = 1,   ∈ R ℓ and   ∈ S ℓ parameterise the initial state for cluster ,   ∈ R × and Γ  ∈ S  describe the dynamics of the states, and   ∈ R ×ℓ and Λ  ∈ S ℓ describe the relationship between the states and the measurements.Here, S ℓ denotes the set of valid ℓ × ℓ covariance matrices.
We performed expectation-maximization [EM; 21] with hard cluster assignment to train models.Each individual model was trained from 1000 different initializations of cluster assignment.The first initialization was the output of k-means applied to the initial state in each trajectory.Performing k-means over all hidden states was not possible, as trajectories varied in length.Specifically, we used the scikit-learn [22] instantiation of the k-means++ algorithm [23].In the M-step, for each cluster, we used the trajectories assigned to the cluster and learned the model-specific parameters as follows.The mean mc and covariance Sc of the initial state were set to the empirical mean and covariance.The other cluster-specific parameters were learned with regularized linear regression.The global cluster assignment parameter π was set to the empirical incidence rate of each cluster.The E-step was then performed in the standard manner, assigning each trajectory to the cluster to which it was most likely to belong.Training ran for 1000 steps or until convergence.If none of the hard assignments changed in a given iteration, then EM has successfully completed.After completing training for each of the initializations, we selected the model having the highest expected complete data log likelihood: (2) where Θ = {  ,   ,   , Γ  ,   , Λ  } denotes the cluster-specific parameters for the th cluster and   is 1 when the th datapoint is most likely to belong to the th cluster and 0 otherwise.

Cluster demographics and profiling
Tables S4A and S4B summarize the cluster-wise demographics for ADNI and MACC clusters.

Ablation against an unsupervised model trained on baseline features alone
GMM-baseline.To test the advantage of modeling trajectories compared to cross-sectional snapshots, we trained a Gaussian mixture model on baseline data from each trajectory with scikit-learn [22] from a k-means initialization for each cross-validation fold (on the same folds as used to train the MTM).Labels were assigned to the clusters identified by each of the models in the same manner as for MTM clusters.
The resulting assignments are referred to as GMM-baseline.Further, we compared cluster assignment for GMM-baseline against MTM trained on longitudinal data but tested on baseline data alone.The resulting assignments are referred to as MTM-baseline.

Survival Modeling with Cox Proportional Hazards Models
All Cox proportional hazards models [24] were trained with the lifelines package [25] using Breslow's estimator [26] for the baseline.In every case, the outcome of interest was an AD diagnosis during a person's trajectory, among persons not having an AD diagnosis at baseline.For the univariate analysis, 10-fold cross-validation was used with mild regularization, allowing the concordance index [27] to be calculated on heldout data.

Model marginalization for missing data
A single linear Gaussian state space model is jointly Gaussian in ( 1: ,  1: ), allowing one to exactly and efficiently marginalize over missing variables and missing components of variables.For example, given a new participant having only measurement data, we can compute   ( 1: ) := ( 1: |) for each  to estimate cluster membership for this new participant as where the constant of proportionality depends on  1: and we have Integrating over multivariate Gaussians can be done analytically.This process of marginalizing out unavailable or missing data can also be used during the training process, avoiding data imputation.
In particular, we derive a closed-form expression for a single component of the trajectory model.The joint distribution on the states given by: may be re-expressed as: where each entry in the mean vector is a  × 1 block and each entry in the covariance matrix is a  ×  block.We use  ′ to denote the transpose of the matrix  to avoid any confusion between  and ⊤.If we define the diagonal elements as follows for 1 ≤  < , and for  > , let    =    −  where   =  ′   , then we find that: This gives us a closed form expression for the joint Gaussian distribution of the hidden states.We can then use the rules for Gaussian distributions to determine the joint distribution: From here, marginalising out hidden or missing random variables can be done by removing minors in the covariance matrix and the corresponding elements in the mean vector.For example, we have the joint distribution of the measurements: .

MTM with nonlinear specifications
Linear Gaussian dynamics are well understood and support analytic marginalization for training and predicting in the presence of missing data.However, in many cases, this formulation may prove restrictive for practical applications.For example, a linear model does not allow grey matter deterioration to accelerate at any point during the progression of dementia, nor does it allow for some biomarkers like β-amyloid to stop accumulating after reaching maximum saturation level in the brain.Modeling with nonlinear dynamics allow us to capture such non-linear relationships.In the first instance, we adopt a hybrid framework that maintains the linear Gaussian state model but learns nonlinear class-conditional measurement models where  is a k-NN regressor on training data.Our choice of hard assignment in the E step of EM makes learning the cluster-specific dynamics in the M step straightforward.In the reported results, we select k, the number of neighbors, with cross-validation (on 5, 10, or 15 neighbors).
We report results in Fig. S3 and Table S3.

Table S2 Clinical diagnosis per cluster for MTM training folds. A.
Cluster assignment frequencies for each cross-validated training run using ADNI data.Note that while the test datasets in cross-validation partition dataset, the training datasets do not: any two runs share 8/9ths of their training data (i.e.training data across folds are not independent).See TableS4Acolumn I for comparison.B.

Table S3 Clinical diagnosis per cluster for MTM with nonlinear specifications on ADNI. A.
Cluster assignment frequencies and per-cluster clinical outcome frequencies for 4-cluster MTM with linear Gaussian state dynamics and nonlinear, Gaussian measurement model.B.Same results for 4-cluster MTM with a nonlinear, Gaussian state model and measurement model.(SeealsoFig.S3.)

Table S4 Cluster assignment for 4-cluster MTM trained and tested on ADNI. A.
Cluster assignment frequencies from cross-validated predictions on: I. complete trajectories, II.trajectories of cognitive data alone, III.all features from the first assessment (baseline), IV. cognitive scores from the first assessment, V. all scores from the final assessment (this is the assessment to which clinical labels are associated), and VI.Cognitive scores from the final assessment.B. Clinical outcome frequencies per cluster.Columns as in A.

Table S5 Relating
MTM to GMM. A. Cluster assignment frequencies using I. MTM tested on trajectories, II.MTM tested on baseline data alone, III.GMM model tested on baseline ADNI data.Note that while II.andIII.are tested on baseline feature data, MTM (II) is trained using longitudinal data, while GMM (III) is trained on baseline data alone.B. Clinical outcome frequencies per cluster.Columns as in A. (For both A and B, columns I & II match columns I & III in TableS4, respectively.)

Table S6 Cluster assignment for 3-cluster MTM tested on MACC
. A. Cluster assignment frequencies using a model trained on ADNI to predict on MACC: I. complete trajectories, II.trajectories of cognitive data only, and III.all features from the final assessment.B. Clinical outcome frequencies per cluster.Columns as in A.